Neural Computation - Coursework 2

Shinichi Tsukada, 2123193

Introduction

Semantic segmentation is the method of image annotation that assigns a label for each pixel of the images. This has been applied to image recognition in practice widely, such as autonomous driving and medical image processing. The main purpose of this experiment is to build Convolutional Neural Network (CNN) that can segment cardiovascular magnetic resonance (CMR) images properly. Hence, semantic segmentation is the right algorithmic approach to achieve this aim.

From 2015 to 2020, many segmentation models have excellent performance in MRI segmentations, which are mostly based on the U-Net family. In this experiment, We will firstly reviewed the state-of-art U-Net family models, using U-net as an example to optimize the hyperparameter on learning rate and optimization function. Then we compared the selected 4 models using our implemented evaluation function on validation data, then we selected the best one for testing.

Review

According to the review papers [1, 7], many network architectures were developed after 2015, with the outstanding 2D U-Net[6], which performed well on medical image segmentation problem. Later, the U-Net idea also generated models for 3D structure such as V-net[5] and 3D U-Net[11] in 2016 and 2017. In 2018, UNet++ was published in paper [9], adding up dense skip pathways in the middle to help optimiser to train more efficiently, while the traditional U-Net has a large semantic gap between decoder and encoder layers. Many other models do similar thing on modifying the shape of layers and adding different skip connections, such as MDU-Net [8], LadderNet[10], and nnU-Net[3]. In 2019, MultiResUnet[2] was implemented on keras by using gradually increasing filters and adding up residual connections to have a good performance. In 2020, U-Net3+ is generated to have fully connected skip connections between encoders and decoders to achieve a higher score in multi-scale organs.

In this experiment, we choose 3 U-Net family models: U-Net, U-Net++ and U-Net3+, with one layer U-Net models as comparison. The model was trained and validated, we then choose the relatively higher and efficient model based on their validation scores.

Inplementation

Dataset

image.png

Figure 1: The left is a CMR image, while the right is a mask image of the segmented left image.Thereare four classes for the segmented mask: Class 0 (black) is the background region. Class 1 (darkgrey) is the right ventricle (RV) region. Class 2 (white grey) is the myocardium (Myo) region. Class3 (white) is the left ventricle (LV) region.

CMR imaging is an essential way to detect many cardiovascular diseases by analysing cardiac chamber volume and mass. Although clinicians have completed this task manually for decades, this is time-consuming work and contains the possibility for subjective errors to occur. Hence, building a CNN which can segment magnetic resonance automatically would contribute greatly to the medical field.

We have prepared 200 CMR images in a PNG format of which were split into 100 for train data, 20 for validation data and 80 for test data. There are 100 and 20 ground true mask images for train and validation data, respectively. Therefore, the model was trained by training input data and true mask train images. In addition to this, we have tuned hyperparameters using validation input data and true mask validation images. The structure of the input images is (B, 1, 96, 96), where (Batchsize, Channel, Height, Width), while that of output images is (B, 4, 96, 96) because there are four classes, namely white (LV), white grey (Myo), dark grey (RV) and black (the background).

Data Preprocessor

For preparing the training and validation dataset, folloing the instruction by pytorch, we generated our training_data_loader and valid_data_loader. The train_data_path and valid_data_path denotes the location of the data. Since this work is working on Colab through Google Drive, Please change the following data path to your location.

We will use the best model to test, so we set the location of model_data_path to save the best models refer to validation loss. The writer save the information during training and validation for diagram generating and testing

Network Architecture

first implementation

In this experiment, we have produced a model from scratch which deepened our understanding of convolutional neural networks. We have also implemented several state of the art models in order to compare performance of those models and understand cutting edge models. Our model is shown by Figure 2:

image.png

Figure 2: The model which we have built by ourselves. Zero-padding is 1, stride is 1 for eachconvolutional layer. The filter sizes of max pooling and max unpooling is 2x2 and their stride is 2. The activation function in all convolutional layers is the Relu function. All the weights are initialised by He initialisation. Batch normalisation is implemented between convolutional layers. Skip connection is used for some layers (grey arrows)

In terms of our model, we incorporated four max pooling layers in the neural network so as to extract important features, and then we combined four max unpooling layers which can produce images maintaining the extracted important features of which their size is the same with input images. However, deep networks like this model tend to lose the spatial information. In order to ease the issue, we have used skip connection [3] which merge identity mapping and outputs of the stacked layers. Additionally, in deep learning, the distribution of values of the input data changes for each layer, this makes training less efficient. So, we have dealt with this problem by batch normalisation [10]. This method can fix the distribution by using each training mini-batch before the units of the activation function. Furthermore, the initial weights are determined by He initialisation [2], which produces values based on the normal distribution of which the mean is 0 and the the variance is $\frac{2}{n_i}$, where $n_i$ is the number of units in i-th layer. For example, $n_i$ is 9216 when $n_i$= 1. In addition, anotable role of these algorithms is to alleviate problem of vanishing/exploding gradients. In part ofthe experiment which evaluate performances of skip connection and batch normalisation, we have set 0.001(learning rate), Adam as an optimiser and cross entropy loss as the cost function. The main model that we have tuned hyper parameters is U-net. This model also include skip connection and batch normalisation. Therefore, we can understand behaviour of this model deeply.

Unet

In order to produce our most accurate attempt at the image segmentation problema U-Net architecture was also implemented. Ronneburger et al suggest that this architecture for image segmentation, which was originally designed for biomedical image segmentation, works effectively even when a small number of training examples are available; as is such in this task. Using the implementation from milesial[8] the U-Net implementation was able to produce ascore of 0.8945 in the public ranking on kaggle. The architecture of U-Net is described in the image below.

image.png

Figure 3: U-Net Architecture

The architecture is U-shaped and symmetric and consists of two main parts; the left side implements a general convolutional network design which uses the ReLU activation function. This aspect is called the contracting path. In this section each process contains two convolutional layers, this increases the depth of the image. The pooling layers cause the image to halve in size each time. The righthand side is referred to as the expansive path. The expansive path differs from other convolutional neural networks, this side is responsible for ‘upsizing’ the parsed image to its original size, it does this by using a transposed convolutional layer. This technique is not the same as a deconvolution, the operation to revert a convolution. A transposed convolution performs a convolution, but reverts the image back to its original spatial form. This is useful as it performs the upscaling of the image and a convolution in a single operation. After the image is ‘upsized’ to its original size this image and its counterpart from the contracting path are concatenated; this returns a more accurate prediction.

The following U-net network is refer to Github:

https://github.com/milesial/Pytorch-UNet

U-net++ and U-net 3+

For further comparison, we set extra three models in U-net Family: One layer U-net, U-net ++ and U-net 3++.

For One layer U-net, as a comparison to U-net, for preventing overfitting. The U-net ++ and U-net 3++ refer to github as :

U-net ++ : https://github.com/4uiiurz1/pytorch-nested-unet U-net 3+ : https://github.com/ZJUGiveLab/UNet-Version

image-2.png

Figure 4 The Network Structure of U-net Family

We listed the model of U-net ++ below:

Training Function

During training part, we firstly convert our data for cuda GPU by .cuda(). The training_loss was calculated by Average Meter() to calculate the average scores of each epochs (For example, with about 25 data at set of 4 batch_size)

By passing the network, we update the writer's scalar of 'train_loss' and save the log files on local address.

Mentioned that one channel was added through unsqueeze(1) function since the data structure that CNN required is [N, C, H, W] while our input is [N, H, W]

The code of train function is showed below:

Validation Function

Dice Scores

Other validation scores we use in experiment part

In comparison of the model we choose, we develop a way to generate the important information to denote the performance of model. The first and the most important is the pixel accuracy, which is the number of correct predicted pixel among the whole number of pixel of the images:

$$ PA=\frac{\sum_{i=1}^kp_{ii}}{\sum_{i=0}^k\sum_{j=0}^kp_{ij}}$$

The other important validation information is the ratio that each class is classified correctly, the average scores is calcualted by:

$$MPA=\frac{1}{k+1}\sum_{i=0}^{k}\frac{p_{ii}}{\sum_{j=0}^kp_{ij}}$$

In segmantic segmentation, the average ratio between ground truth and predicted segmentation, which is the true value in overlap set of ground truth and predictions, denotes the accuracy on each class:

$$MIoU=\frac{1}{k+1}\sum_{i=0}^k\frac{p_{ii}}{\sum_{j=0}^k p_{ij}+\sum_{j=0}^kp_{ji}-p_{ii}}$$

In our implementation , the PA, MPA and MIoU corresponding to acc, acc_cls, mean_iu in our code will be calculated by evaluation function get through validation process. The return scores will be recorded by Writer for later use.

We also printed the validation of each 4 batch during the validation process for us to determine the models' overfitting and underfitting.

Training and Validation Process

Through our training process, we use ReduceLROnPlateau for tuning the learning rate dynamically during the training process according to validation loss.

For better comparison, all of the model was trained by same optimiser and loss function:

Adam optimiser with lr at 1e-5 and decreasing at 10 patience

CrossEntropyLoss as it's provided default in pytorch

The best model was saved for testing through the following code:

We compare the best_val_loss in the buffer and compare the current validation loss. If the currently validation loss is lower than the previous one, we then save the model in the address PATH

Please change the PATH to your folder when you try to run the following code

U-net Training Process

Final epochs are shown blew

image.png

U-net ++ training Process

Final epochs are shown blew

image.png

U-net3+ Training Process

Please note that the network of U-net3+ was copied directly from Github with few changes as mentioned above, thus you may need to import U-net 3+ first to run following code

Final epochs are shown blew

image.png

Experiment

Model Selection

We use tensorboard to show the image that saved in Writer, the result was showed below.

image-6.png

image-7.png

image-5.png

Table 1 The Training Results

From the figures above, we can see that the other models were converged in around 150 epochs while the U net++ takes extra long time to train.

Since the best model was saved depend on validation loss, the final best model that was trained among the whole preocess takes best_epoch epochs to reach.

The training_loss vs. Val_loss for checking wether the model was overfitting or underfitting. The Figures (a) of each Network shows the result. Most of the model's validation loss is slightly higher than training loss, but they all converged in an acceptable level, thus overfitting does not happened.

The validation accuracy test the model's performance on validation data, with two U-net models to reach 94% and improved one Unet++ and Unet 3+ to reach 96%. This result can also showed in acc_cls figure as the improved U-net++ and Unet 3+ reached higher than 90% but the classical U-net only reached 80%. However, The single layer U-net achieved a relatively higher scores on acc_cls compare with U-Net. The possible reason might be for small data set that we are provided, the deeper Network will get some kind of initial over-training. In some cases, the simple Layer network got higher scores instead.

The training and validation result was showed in Table 1. From the result we can see the best model was U-Net++ with val_acc at 0.9604 compare with classical U-net model that reach 0.9441. However, the U-Net++ takes 461 epochs to reach the best model while U-Net only need around 100 epochs. Considering the limitation of our computing resources, the classical U-net is more approachable and efficient to us, with relatively high scores. Thus we choose U-Net as our tuning and testing model.

Activation Function

The default activation function implemented in the U-Net CNN architecture is the ReLU function.This is a common choice as it performs consistently well over many different architectures. This section will investigate four alternative activation functions to ReLU: Leaky ReLU, GELU, Tanhand Sigmoid. The experiments conducted kept other hyperparameters such as the loss function, optimiser, mini-batch size and number of epochs, and others discussed in this report, the same in order to isolate the impact of the activation function.

image.png Table 2: Table of Dice Scores Corresponding to Different Activation Functions

The Leaky ReLU has a similar construction to the ReLU activation function, however, it differs in the way it treats negative values. The ReLU activation function sets all negative values to zero whereas the leaky ReLU allows a slightly decreasing curve which has the aim of reducing the numberof ‘dead’ neurons activated by the ReLU function. With respect to validation loss which is shownin fig 5; the ReLU and Leaky ReLU functions behaved very similarly, both almost fully converging after 20 epochs. The only slight difference being a slightly faster initial drop in validation loss for Leaky ReLU. However, the ReLU activation function still outperformed the Leaky ReLU functionby a small margin. In the table above the dice score for both functions is shown; with the ReLUfunction achieving 0.860 and Leaky ReLU achieving 0.855.

The GELU function stands for Gaussian Error Linear Unit function. The function is formed by mul-tiplying the inputs by the cumulative gaussian function. The resulting form resembles a ‘smoother’ ReLU. This allows the function to discount large negative values but still include small negative values. The above zero input values are the same as the ReLU function. Similarly to the ReLU and Leaky ReLU functions the GELU function almost completely converges after 20 epochs and follows, predictably, a very similar path in fig 5 where the validation loss is plotted. However, the GELU function performed worse than the ReLU function with a dice score of 0.841 compared to 0.860.

The Tanh activation function differs significantly from the ReLU activation function. The input values are zero-centred between a range of -1 and 1. The function failed to converge properly in 20 epochs and, as seen in fig 4, had a much slower decrease in its validation loss than the above activation functions. The Tanh function performed significantly worse than the ReLU function with a dice score of 0.718 compared to 0.860. If the model was run at a much larger number of epochs the performance would however improve by a large amount.

The Sigmoid activation function suffered similarly to the Tanh function. It failed to converge, as shown in Fig. 9. Its Dice score therefore was significantly lower than the ReLU functions’ score at 0.764 compared to 0.860.

Therefore, as the results in the table above show, there is no evidence or reason to think that any of the alternate activation functions experimented with would produce improved performance to the ReLU activation function.

image-2.png

Figure 9: Activation Function Experiments

Optimiser

To improve the attributes of our neural network such as weights and learning rates; in order to optimise the reduction of network losses three optimization algorithms were implemented to tweak the architecture to get better results. The optimization functions explored in training our model are: Stochastic Gradient Descent (SGD), Adam’s method, and Adagrad method. The SGD executes updates to the models’ parameters more frequently, the models are altered after the computation of loss on a random training data point. The frequent updates increase the speed of convergence, thus, reducing the training time. Whereas, Adagrad changes the learning rate for each parameter at each stepping time, thus the learning is appropriately selected. The Adam methods aims to decay the loss function more gradually without jumping the minimum, this is achieved by storing the decaying average past squared gradient, and computes the mean and second moment before updating the parameter. This method is therefore fast and converges more rapidly.

image.png

Table 3: Table of accuracy corresponding to Optimization algorithms

To adequately evaluate our training results corresponding to optmization functions, we use the sameparameter initialization when comparing different optimization algorithms. The hyper-parameters, such as learning rate, epoch, and mini-batch are tuned to find the best optimal value. The evidence obtained show that the Adam optimizer performed better compared to the stochastic and the Ada-grad in training our model. Although SGD and Adagrad make a rapid progress lowering the cost in the initial stage of the training, whereas Adam optimizer converges considerably faster. We also notice controlling the mini-batch variance was very significance in our model, and contributes to the speed up of the optimizer. Though Adam shows marginal improvements over SGD, it adapts thelearning rate scale for different layers instead of hand picking manually as in SGD.

Loss Function

The trainable parameters in our network update with respect to the loss function, thus a relevant loss function is very crucial in improving our accuracy. The Loss functions we choose to investigated in this model are: the Cross-Entropy, the Focal loss, the Soft Dice loss. Cross-entropy is part of pytorch package, whereas the Focal loss and Soft dice are implemented as described in Dense Object Detection paper [7]. To validated the effects of different loss functions on training and model performance, we kept the hyper-parameter as constant while investigated the performance of all three loss function.

image.png

Table 4: Table of Accuracy corresponding to loss function methods

image-2.png

Figure 10: Figure Showing Loss Functions Converge Corresponding to Number of Epoch

The performance of the model trained with three loss functions are described in Table 4. The Focal loss function slightly outperformed the stochastic and soft dice, nevertheless; the performance difference between all three loss functions were negligible, because of small difference which could have been caused by fluctuation during model training. Figure 10, shows the three loss function converging over the number of epochs. The graph shows the Focal loss converging more rapidly compare to the Soft Dice loss, the Cross-Entropy shows less oscillation and it converges more smoother. The results also suggested choosing the optimal loss function can improve performance.

Furthermore, by comparing our loss functions and their implementation. We selected the Cross-entropy as our model loss function, one of the reasons being implementing Cross-Entropy has shown to be faster in training our model without compensating for test accuracy. However, in case of class imbalance comparison, other loss functions such as Soft Dice loss are better options for training the model.

Learning Rate

In order to investigate the impact learning rate has on model accuracy the other hyper-parameters were fixed such as the optimiser being set to Adam and the number of epochs at 15. The model was run at various different learning rates from 1.0 to $10^{-7}$. The learning rate of 0.001 was seen to be optimal as it produced a dice score of 0.857. Increasing the learning rate beyond this point resulted in the dice score starting to decrease. The table below shows the various learning rates that were experimented with and the corresponding dice score. A graph of the learning rate against the dice score has been included below from which the effect of increasing/decreasing the learning rate can be seen.

image.png

Table 5: Table of Dice Score corresponding to Learning Rate

image-2.png

Figure 11: A graph of the Log Learning Rate against the Dice Score: The x-axis represents the log of the learning rates in which the numbers -7,-6,-5,-4 etc represents 10e-7; 10e-6; 10e-5; 10e-4:

Training Time

To optimise the training time, we tune the number of epochs used in the model. Other hyperparameters such as batch size, learning rate and choice of optimiser were kept fixed. By looking at the dice score and the time it took to run we can conclude that the number of epochs suitable is 15. Increasing the number of epochs increases the run time, however the run time for 15 epochs was around 18 minutes which is reasonable. A low number of epochs under fits the data therefore picking the most optimal number of epochs may be difficult. The following table displays the run time and dice score at each number of epochs that was run. As well as the table a graph has also been attached in order to visualise the results.

image.png

Table 6: Table of Dice Score and Run Time corresponding to Number of Epochs

image-2.png

Figure 12: A graph of the Number of Epochs against the Dice Score 14

Mini-Batch Size

The mini-batch size refers to the number of training examples out of the entire training set in which, over each iteration, the model is trained and updated on. In order to gauge how the mini-batch size impacts the dice score of the model all hyperparameters apart from mini-batch size were kept constant. The number of training examples available is 100, therefore the batch sizes which aretested range from 1-100

image.png

Table 7: Table of Dice Scores Corresponding to Different Mini-Batch Sizes

A mini-batch size of 4 produced the highest dice score at 0.860 as can be seen in the table above.This peak suggests that the optimum batch size is between 1 and 10. This is because after a batch size of one the dice score initially increases at a value of 4, then decreases at the value of 10. After a mini-batch size of 10 the values of 25, 50 and 100 decrease as the mini-batch size increases as can be seen in the table above.

These scores match up to what one would expect to occur. A lower batch size will take longer to run but will converge faster than a higher batch size. This can be seen in fig 4 below where the larger batch sizes can be seen to converge at a, generally, smoother, but slower rate. Therefore, at 20 epochs, 4 is the strongest mini-batch size tested. As the number of epochs is increased however a slightly higher batch size may be more optimal.

image.png

Figure 13: Mini-Batch Size Experiments

Understanding of skip connection and batch normalisation

In order to understand behaviour of skip connection and batch normalisation, we have evaluated the performances of our models where skip connection or batch normalisation is removed. We have trained these models on 60 epochs. The results are shown in Figure 14:

image.png

Figure 14: Cross entropy loss based on the trained models. There are four models: the model which are trained with both skip connection and batch normalisation(blue trajectory), the model without batch normalisation(green), the model without skip connection(yellow) and the model without the both methods(red). the learning rate is 0.001 and Adam is implemented.

It is obvious that the model with the both methods produced the best performance. The dice scores on validation data are 0.8134 for our model, 0.7308 for our model without skip connection, 0.0039 for our model without batch normalisation and 0.0000 for our model without the both methods. Unstable distribution of input data, due to deep network, has an huge negative impact on training data. Additionally, loss of spatial information disturb the models learning. However, based on this result, batch normalisation and skip connection can overcome these issues. Therefore, these methods are essential to improve semantic segmentation models.

Conclusion

In conclusion, we have built a CNN from scratch to understand the behaviours of CNN's and the methods which can deal with vanishing/exploding gradient problems, namely batch normalisation and skip connection. In addition, we successfully implemented a state of the art CNN architecture; U-Net. Following the tuning of the hyper parameters of the U-net implementation and our own CNN architecture to obtain the best performance possible in image segmentation task; we found our best model performance, which produced the highest score on the public Kaggle leaderboard for the assignment, was that of the U-Net architecture with a score of 0.89445. The other models which were submitted to Kaggle performed as follows: Our own architecture produced a score of 0.86405, U-Net++ produced a score of 0.81319 and the U-Net implementation using focal loss as the loss function in place of cross entropy loss received a score of 0.87972. The focal loss, although it still produced an acceptable score, performed worse than was expected given the results of the experiments on the loss functions. Therefore, the hyperparameters which were found to be the most successful with the U-Net architecture were: Adam as the optimizer, Cross-Entropy Loss for the loss function, a learning rate of 0.001, a mini-batch size of 4, a training time of 80 epochs and finally ReLU as the activation function.

Reference

[1] Chen Chen, Chen Qin, Huaqi Qiu, Giacomo Tarroni, Jinming Duan, Wenjia Bai, and Daniel Rueckert. Deep learning for cardiac image segmentation: A review. Frontiers in Cardiovascular Medicine, 7, Mar 2020.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE inter- national conference on computer vision, pages 1026{1034, 2015.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770{778, 2016.

[4] Nabil Ibtehaz and M. Sohel Rahman. Multiresunet: Rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Networks, 121:74{87, Jan 2020.

[5] Sergey Iofie and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[6] Fabian Isensee, Jens Petersen, Andre Klein, David Zimmerer, Paul F. Jaeger, Simon Kohl, Jakob Wasserthal, Gregor Koehler, Tobias Norajitra, Sebastian Wirkert, and Klaus H. Maier- Hein. nnu-net: Self-adapting framework for u-net-based medical image segmentation, 2018.

[7] M. et al. Long. Focal loss Dense Object Detection Kfiorper. (German) [On facebook ai research]. Neural Networks, 322(10):1{3, 2018.

[8] Milesial. Github - pytorch-u-net, 2020.

[9] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565{571. IEEE, 2016.

[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015.

[11] Saeid Asgari Taghanaki, Kumar Abhishek, Joseph Paul Cohen, Julien Cohen-Adad, and Ghassan Hamarneh. Deep semantic segmentation of natural and medical images: A review, 2020.

[12] Jiawei Zhang, Yuzhen Jin, Jilan Xu, Xiaowei Xu, and Yanchun Zhang. Mdu-net: Multi-scale densely connected u-net for biomedical image segmentation, 2018.

[13] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation, 2018.

[14] Juntang Zhuang. Laddernet: Multi-path networks based on u-net for medical image segmentation, 2019.

[15] fiOzgfiun Cfi ificek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: Learning dense volumetric segmentation from sparse annotation, 2016.